1. Data Loading

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime as dt
In [2]:
price1999 = pd.read_csv('Data/resale-flat-prices-based-on-approval-date-1990-1999.csv')
price2012 = pd.read_csv('Data/resale-flat-prices-based-on-approval-date-2000-feb-2012.csv')
price2014 = pd.read_csv('Data/resale-flat-prices-based-on-registration-date-from-mar-2012-to-dec-2014.csv')
price2016 = pd.read_csv('Data/resale-flat-prices-based-on-registration-date-from-jan-2015-to-dec-2016.csv')
price2017 = pd.read_csv('Data/resale-flat-prices-based-on-registration-date-from-jan-2017-onwards.csv')
cpi = pd.read_csv('Data/CPI.csv')
In [3]:
price1999.head()
Out[3]:
month town flat_type block street_name storey_range floor_area_sqm flat_model lease_commence_date resale_price
0 1990-01 ANG MO KIO 1 ROOM 309 ANG MO KIO AVE 1 10 TO 12 31.0 IMPROVED 1977 9000
1 1990-01 ANG MO KIO 1 ROOM 309 ANG MO KIO AVE 1 04 TO 06 31.0 IMPROVED 1977 6000
2 1990-01 ANG MO KIO 1 ROOM 309 ANG MO KIO AVE 1 10 TO 12 31.0 IMPROVED 1977 8000
3 1990-01 ANG MO KIO 1 ROOM 309 ANG MO KIO AVE 1 07 TO 09 31.0 IMPROVED 1977 6000
4 1990-01 ANG MO KIO 3 ROOM 216 ANG MO KIO AVE 1 04 TO 06 73.0 NEW GENERATION 1976 47200
In [4]:
price2012.head()
Out[4]:
month town flat_type block street_name storey_range floor_area_sqm flat_model lease_commence_date resale_price
0 2000-01 ANG MO KIO 3 ROOM 170 ANG MO KIO AVE 4 07 TO 09 69.0 Improved 1986 147000.0
1 2000-01 ANG MO KIO 3 ROOM 174 ANG MO KIO AVE 4 04 TO 06 61.0 Improved 1986 144000.0
2 2000-01 ANG MO KIO 3 ROOM 216 ANG MO KIO AVE 1 07 TO 09 73.0 New Generation 1976 159000.0
3 2000-01 ANG MO KIO 3 ROOM 215 ANG MO KIO AVE 1 07 TO 09 73.0 New Generation 1976 167000.0
4 2000-01 ANG MO KIO 3 ROOM 218 ANG MO KIO AVE 1 07 TO 09 67.0 New Generation 1976 163000.0
In [5]:
price2014.head()
Out[5]:
month town flat_type block street_name storey_range floor_area_sqm flat_model lease_commence_date resale_price
0 2012-03 ANG MO KIO 2 ROOM 172 ANG MO KIO AVE 4 06 TO 10 45.0 Improved 1986 250000.0
1 2012-03 ANG MO KIO 2 ROOM 510 ANG MO KIO AVE 8 01 TO 05 44.0 Improved 1980 265000.0
2 2012-03 ANG MO KIO 3 ROOM 610 ANG MO KIO AVE 4 06 TO 10 68.0 New Generation 1980 315000.0
3 2012-03 ANG MO KIO 3 ROOM 474 ANG MO KIO AVE 10 01 TO 05 67.0 New Generation 1984 320000.0
4 2012-03 ANG MO KIO 3 ROOM 604 ANG MO KIO AVE 5 06 TO 10 67.0 New Generation 1980 321000.0
In [6]:
price2016.head()
Out[6]:
month town flat_type block street_name storey_range floor_area_sqm flat_model lease_commence_date remaining_lease resale_price
0 2015-01 ANG MO KIO 3 ROOM 174 ANG MO KIO AVE 4 07 TO 09 60.0 Improved 1986 70 255000.0
1 2015-01 ANG MO KIO 3 ROOM 541 ANG MO KIO AVE 10 01 TO 03 68.0 New Generation 1981 65 275000.0
2 2015-01 ANG MO KIO 3 ROOM 163 ANG MO KIO AVE 4 01 TO 03 69.0 New Generation 1980 64 285000.0
3 2015-01 ANG MO KIO 3 ROOM 446 ANG MO KIO AVE 10 01 TO 03 68.0 New Generation 1979 63 290000.0
4 2015-01 ANG MO KIO 3 ROOM 557 ANG MO KIO AVE 10 07 TO 09 68.0 New Generation 1980 64 290000.0
In [7]:
# Merge dfs
prices = pd.concat([price1999, price2012, price2014], sort=False)
prices = pd.concat([prices, price2016, price2017], axis=0, ignore_index=True, sort=False)

prices['month'] = pd.to_datetime(prices['month']) # to datetime

prices.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 826581 entries, 0 to 826580
Data columns (total 11 columns):
month                  826581 non-null datetime64[ns]
town                   826581 non-null object
flat_type              826581 non-null object
block                  826581 non-null object
street_name            826581 non-null object
storey_range           826581 non-null object
floor_area_sqm         826581 non-null float64
flat_model             826581 non-null object
lease_commence_date    826581 non-null int64
resale_price           826581 non-null float64
remaining_lease        117527 non-null object
dtypes: datetime64[ns](1), float64(2), int64(1), object(7)
memory usage: 69.4+ MB
In [8]:
prices[~prices.isnull().any(axis=1)]['month'].dt.year.unique()
Out[8]:
array([2015, 2016, 2017, 2018, 2019, 2020], dtype=int64)

remaining_lease has lots of NAs. They are only available after 2015 sales onwards.

2. Data Cleaning

In [9]:
# Clean flat type
prices['flat_type'] = prices['flat_type'].str.replace('MULTI-GENERATION', 'MULTI GENERATION')
prices['flat_type'].unique()
Out[9]:
array(['1 ROOM', '3 ROOM', '4 ROOM', '5 ROOM', '2 ROOM', 'EXECUTIVE',
       'MULTI GENERATION'], dtype=object)
In [10]:
# Rename flat model duplicates
replace_values = {'NEW GENERATION':'New Generation', 'SIMPLIFIED':'Simplified', 'STANDARD':'Standard', 'MODEL A-MAISONETTE':'Maisonette', 'MULTI GENERATION':'Multi Generation', 'IMPROVED-MAISONETTE':'Executive Maisonette', 'Improved-Maisonette':'Executive Maisonette', 'Premium Maisonette':'Executive Maisonette', '2-ROOM':'2-room', 'MODEL A':'Model A', 'MAISONETTE':'Maisonette', 'Model A-Maisonette':'Maisonette', 'IMPROVED':'Improved', 'TERRACE':'Terrace', 'PREMIUM APARTMENT':'Premium Apartment', 'Premium Apartment Loft':'Premium Apartment', 'APARTMENT':'Apartment', 'Type S1':'Type S1S2', 'Type S2':'Type S1S2'}

prices = prices.replace({'flat_model': replace_values})

prices['flat_model'].value_counts()
Out[10]:
Model A                 228389
Improved                217356
New Generation          177570
Simplified               53960
Standard                 39854
Premium Apartment        35066
Apartment                32004
Maisonette               28798
Model A2                  9109
DBSS                      1609
Adjoined flat             1085
Terrace                    642
Multi Generation           502
Type S1S2                  401
Executive Maisonette       196
2-room                      40
Name: flat_model, dtype: int64

Types of Flat Models:

Standard: (1/2/3/4/5-room). 1960s HDB. Had WC and shower in same room. 5-room Standard were introduced in 1974.
Improved: (1/2/3/4/5-room). Introduced in 1966, the 3/4-room having separate WC and shower, they also featured void decks. 5-room Improved were introduced in 1974.
New Generation: Started first in 1975, New Generation flats can be 3-Room (67 / 82 sqm) or 4-Room (92 sqm), featuring en-suite toilet for master bedroom, with pedestal type Water Closet, plus store room.
Model A: Introduced in 1981: 3-Room (75 sqm), 4-Room (105 sqm), 5-Room (135 sqm), 5-Room Maisonette (139 sqm)
Model A2: Smaller units of Model A. e.g., 4-Room Model A2 (90 sqm)
Simplified: Introduced in 1984: 3-Room (64 sqm), 4-Room (84 sqm)
Multi Generation: 3Gen flats designed to meet the needs of multi-generation families.
Maisonette: AKA Model A Maisonette — 2 storeys HDB flat
Premium Apartment: Introduced somewhere during 1990s, featuring better quality finishes, you get them in ready-to-move condition, with flooring, kitchen cabinets, built-in wardrobes
Executive Maisonette: More premium version of Model A Maisonettes. These units are no longer being built after being replaced by the Executive Condominium (EC) scheme in 1995
Executive Apartment: Executive Apartment / Maisonette (146-150 sqm) were introduced in 1983 and replaced 5-Room Model A flats, in addition of the 3-bedroom and separate living/dining found in 5A flats, EA and EM feature an utility/maid room. 80% of Executive units were Maisonettes and 20% were Apartments.
DBBS: public apartments built under the HDB's short-lived Design, Build and Sell Scheme (DBSS) from 2005 to 2012. They are a unique (and premium) breed of HDB flats in Singapore, which are built by private developers. High Prices. Quite similiar to Executive Condominium except DBBS is like a premium HDB without facilities of private condos
Adjoined Flat: Large HDB flats which are combined from 2 HDB flats
Terrace: HDB terrace flats built before HDB, without realizing Singapore's land constraint. Discontinued
Type S1S2: apartments at The Pinnacle@Duxton are classified as "S" or Special apartments in view of its historical significance and award-winning design. For application of HDB policies, S1 and S2 apartments will be treated as 4-room and 5-room flats respectively
2-room: Most likely refers to 2-room flexi where there is 1 bedroom and 1 common area

In [11]:
prices['storey_range'].unique()
Out[11]:
array(['10 TO 12', '04 TO 06', '07 TO 09', '01 TO 03', '13 TO 15',
       '19 TO 21', '16 TO 18', '25 TO 27', '22 TO 24', '28 TO 30',
       '31 TO 33', '40 TO 42', '37 TO 39', '34 TO 36', '06 TO 10',
       '01 TO 05', '11 TO 15', '16 TO 20', '21 TO 25', '26 TO 30',
       '36 TO 40', '31 TO 35', '46 TO 48', '43 TO 45', '49 TO 51'],
      dtype=object)
In [12]:
prices['town'].unique()
Out[12]:
array(['ANG MO KIO', 'BEDOK', 'BISHAN', 'BUKIT BATOK', 'BUKIT MERAH',
       'BUKIT TIMAH', 'CENTRAL AREA', 'CHOA CHU KANG', 'CLEMENTI',
       'GEYLANG', 'HOUGANG', 'JURONG EAST', 'JURONG WEST',
       'KALLANG/WHAMPOA', 'MARINE PARADE', 'QUEENSTOWN', 'SENGKANG',
       'SERANGOON', 'TAMPINES', 'TOA PAYOH', 'WOODLANDS', 'YISHUN',
       'LIM CHU KANG', 'SEMBAWANG', 'BUKIT PANJANG', 'PASIR RIS',
       'PUNGGOL'], dtype=object)
In [13]:
plt.hist(prices['floor_area_sqm'], bins=50, edgecolor='black')
plt.title('Distribution of HDB Floor Area')
plt.show()
display(prices[prices['floor_area_sqm'] > 200]['flat_model'].value_counts())
Terrace                 61
Maisonette              14
Executive Maisonette     7
Apartment                4
Adjoined flat            1
Name: flat_model, dtype: int64

The floor area outliers mostly belong to special HDBs that are larger than the standard ones. So they might not be outliers from a multivariate perspective.

In [14]:
bins = prices['lease_commence_date'].max() - prices['lease_commence_date'].min()
plt.hist(prices['lease_commence_date'], bins=bins, edgecolor='black')
plt.title('Distribution of Lease Commence Year')
plt.show()

2.1. Inflation Adjustment Using CPI

In [15]:
# Compute Resale Price Adjusted for Inflation Using Consumer Price Index for Housing & Utilities
# https://www.singstat.gov.sg/find-data/search-by-theme/economy/prices-and-price-indices/latest-data
cpi['month'] = pd.to_datetime(cpi['month'], format='%Y %b') # to datetime
prices = prices.merge(cpi, on='month', how='left') 
# https://people.duke.edu/~rnau/411infla.htm
prices['real_price'] = (prices['resale_price'] / prices['cpi']) * 100 
In [16]:
# Plot Median Resale Prices Over the Years

# Unadjusted
fig = plt.figure(figsize=(14,4.5))
fig.suptitle('Median HDB Resale Prices Over the Years', fontsize=18)
ax1 = fig.add_subplot(121)
prices.groupby('month')[['resale_price']].median().plot(ax=ax1, color='#00cef6', legend=None)
ax1.set_xlabel('Date'), ax1.set_ylabel('Resale Price in SGD ($)'), ax1.set_ylim(0, 500000), ax1.set_title('Unadjusted for Inflation', size=15)

# Adjusted
# https://jakevdp.github.io/PythonDataScienceHandbook/04.09-text-and-annotation.html
ax2 = fig.add_subplot(122)
prices.groupby('month')[['real_price']].median().plot(ax=ax2, color='#3c78d8', legend=None)
ax2.set_xlabel('Date'), ax2.set_ylabel('Resale Price in SGD ($)'), ax2.set_ylim(0, 500000), ax2.set_title('Adjusted for Inflation to 2019 SGD',size=15)
ax2.annotate('1997 Asian Financial Crisis\nMedian: $403,766', xy=('1997-05-01',380000), xycoords='data', 
    bbox=dict(boxstyle="round4,pad=.5", fc="none", ec="#28324a"), xytext=(50,-140), textcoords='offset points', ha='center',
    arrowprops=dict(arrowstyle="->", connectionstyle="angle,angleA=0,angleB=90,rad=20"))
ax2.annotate('2013 Cooling Measures\nMedian: $401,887', xy=('2013-07-01',380000), xycoords='data', 
    bbox=dict(boxstyle="round4,pad=.5", fc="none", ec="#28324a"), xytext=(0,-90), textcoords='offset points', ha='center',
    arrowprops=dict(arrowstyle="->", connectionstyle="angle,angleA=0,angleB=90,rad=20"))  
plt.tight_layout(rect=[0, 0, 0.9, 0.9]) 
# for ax, color in zip([ax1, ax2], ['#3c78d8', '#3c78d8']):
#     plt.setp(tuple(ax.spines.values()), color=color)
#     plt.setp([ax.get_xticklines(), ax.get_yticklines()], color=color)
plt.show()
#prices.set_index('month').loc['1997']['real_price'].median()

Following the collapse of the thai Baht in July 1997, housing prices in Singapore continue to fall and only started gradually increasing again around 2004. In 2013, it experienced a decline due to 'Propery Market Cooling Measures', such as the Additional Buyer's Stamp Duty (ABSD), Loan-to-Value (LTV) Ratio, and Total Debt Servicing Ratio (TDSR). Refer here for more information.

In [17]:
# Convert remaining_lease to number of years
def getYears(text):
    if isinstance(text, str):
        yearmonth = [int(s) for s in text.split() if s.isdigit()]
        if len(yearmonth) > 1: # if there's year and month
            years = yearmonth[0] + (yearmonth[1]/12)
        else: # if only year
            years = yearmonth[0]
        return years
    else: # if int
        return text

prices['remaining_lease'] = prices['remaining_lease'].apply(lambda x: getYears(x))
In [18]:
prices.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 826581 entries, 0 to 826580
Data columns (total 13 columns):
month                  826581 non-null datetime64[ns]
town                   826581 non-null object
flat_type              826581 non-null object
block                  826581 non-null object
street_name            826581 non-null object
storey_range           826581 non-null object
floor_area_sqm         826581 non-null float64
flat_model             826581 non-null object
lease_commence_date    826581 non-null int64
resale_price           826581 non-null float64
remaining_lease        117527 non-null float64
cpi                    826581 non-null float64
real_price             826581 non-null float64
dtypes: datetime64[ns](1), float64(5), int64(1), object(6)
memory usage: 88.3+ MB
In [19]:
bins = prices['remaining_lease'].max() - prices['remaining_lease'].min()
plt.hist(prices['remaining_lease'], bins=int(bins), edgecolor='black')
plt.title('Distribution of Remaining Lease for 2016-2020 Data')
plt.show()
In [20]:
prices.head()
Out[20]:
month town flat_type block street_name storey_range floor_area_sqm flat_model lease_commence_date resale_price remaining_lease cpi real_price
0 1990-01-01 ANG MO KIO 1 ROOM 309 ANG MO KIO AVE 1 10 TO 12 31.0 Improved 1977 9000.0 NaN 60.894 14779.781259
1 1990-01-01 ANG MO KIO 1 ROOM 309 ANG MO KIO AVE 1 04 TO 06 31.0 Improved 1977 6000.0 NaN 60.894 9853.187506
2 1990-01-01 ANG MO KIO 1 ROOM 309 ANG MO KIO AVE 1 10 TO 12 31.0 Improved 1977 8000.0 NaN 60.894 13137.583342
3 1990-01-01 ANG MO KIO 1 ROOM 309 ANG MO KIO AVE 1 07 TO 09 31.0 Improved 1977 6000.0 NaN 60.894 9853.187506
4 1990-01-01 ANG MO KIO 3 ROOM 216 ANG MO KIO AVE 1 04 TO 06 73.0 New Generation 1976 47200.0 NaN 60.894 77511.741715

3. Exploratory Data Analysis

3.1. By Flat Type

In [30]:
## Waffle chart for flat type - number of rooms
from pywaffle import  Waffle

flattype = dict(prices['flat_type'].value_counts()/len(prices)*100)
flattype1519 = dict(prices.set_index('month')['2015':'2019'].reset_index()['flat_type'].value_counts()/len(prices.set_index('month')['2015':'2019'].reset_index())*100)

plt.figure(figsize=(10,5),
    FigureClass=Waffle, 
    plots={
        '211': {
            'values': flattype,
            'legend': {'loc': 'upper left', 'bbox_to_anchor': (1.05, 1), 'fontsize':11},
            'title': {'label': 'Proportion of HDB Flat Types (All Years)', 'loc': 'left', 'fontsize':16}
        },
        '212': {
            'values': flattype1519,
            'legend': {'loc': 'upper left', 'bbox_to_anchor': (1.05, 1), 'fontsize':11},
            'title': {'label': 'Proportion of HDB Flat Types (2015-2019)', 'loc': 'left', 'fontsize':16}            
        },
    },
    rows=5, 
    colors=["#1f77b4", "#ff7f0e", "green", 'purple', 'black', 'yellow', 'brown'],
    #colors=["#3c78d8", "#00cef6", "#aff000", '#28324a', 'black', 'yellow', 'brown'],
    icons='home', 
    font_size=22, 
    icon_legend=True)
    
plt.show()

There are not many 1 Room, 2 Room and Multi Generation flats, so they will be removed for looking at flat types.

In [23]:
flattype = ['3 ROOM','4 ROOM','5 ROOM','EXECUTIVE']
prices1519 = prices.set_index('month').sort_index().loc['2015-01':'2019-12']
prices1519 = prices1519[prices1519['flat_type'].isin(flattype)][['flat_type','real_price']].reset_index()
prices1519['flat_type_year'] = prices1519['flat_type'] + ' - ' + prices1519['month'].apply(lambda x: str(x)[:4])
prices1519
Out[23]:
month flat_type real_price flat_type_year
0 2015-01-01 3 ROOM 227794.502559 3 ROOM - 2015
1 2015-01-01 3 ROOM 245660.738054 3 ROOM - 2015
2 2015-01-01 3 ROOM 254593.855802 3 ROOM - 2015
3 2015-01-01 3 ROOM 259060.414675 3 ROOM - 2015
4 2015-01-01 3 ROOM 259060.414675 3 ROOM - 2015
... ... ... ... ...
100056 2019-12-01 EXECUTIVE 577372.953064 EXECUTIVE - 2019
100057 2019-12-01 EXECUTIVE 562440.893933 EXECUTIVE - 2019
100058 2019-12-01 EXECUTIVE 630132.895326 EXECUTIVE - 2019
100059 2019-12-01 EXECUTIVE 785314.817580 EXECUTIVE - 2019
100060 2019-12-01 EXECUTIVE 615200.836195 EXECUTIVE - 2019

100061 rows × 4 columns

In [24]:
# ridgeline plot for looking at distribution of flat types by year
import joypy
fig, axes = joypy.joyplot(prices1519, by="flat_type_year", column="real_price",figsize=(9,7),
             linewidth=0.05,overlap=1.5,alpha=0.8,colormap=plt.cm.get_cmap('tab20',4))
axes[-1].set_xlim([0,1200000])
axes[-1].set_xticklabels(['0', '200k', '400k', '600k', '800k', '1000k', '1200k']) 
plt.xlabel('Resale Price SGD ($)', fontsize=14)
fig.show()

We don't see much changes within each flat type through the last 5 years. The only consistent pattern is that prices become higher as flats have more rooms, which is unsurprising.

3.2. By Town

In [23]:
## 2015 to 2019
prices['year'] = pd.DatetimeIndex(prices['month']).year # extract out year
prices1519 = prices[prices['year'].isin([2015,2016,2017,2018,2019])].groupby(['town'], as_index=False).agg({'real_price': 'median'}).sort_values('real_price', ascending=True).reset_index(drop=True)
prices1519['real_price'] = round(prices1519['real_price']/1000)
prices1519['color'] = ['#f8766d'] + ['#3c78d8']*(len(prices1519)-2) + ['#00ba38']

# 4-room
prices1519_4room = prices[(prices['flat_type'].isin(['4 ROOM'])) & (prices['year'].isin([2015,2016,2017,2018,2019]))].groupby(['town'], as_index=False).agg({'real_price': 'median'}).sort_values('real_price', ascending=True).reset_index(drop=True)
prices1519_4room['real_price'] = round(prices1519_4room['real_price']/1000)
prices1519_4room['color'] = ['#f8766d','#f8766d'] + ['#3c78d8']*(len(prices1519_4room)-3) + ['#00ba38']

## 1997 vs 2019
# all room type
prices9719 = prices[prices['year'].isin([1997,2019])].groupby(['town','year'], as_index=False).agg({'real_price': 'median'})
prices9719['change'] = prices9719.groupby('town')['real_price'].apply(lambda x: x.pct_change()*100)
prices9719 = prices9719[prices9719['change'].notnull()] 
prices9719 = prices9719.sort_values('change', ascending=True).reset_index(drop=True).reset_index()
prices9719['color'] = prices9719['change'].apply(lambda x: '#00ba38' if x > 0 else '#f8766d')

# 4-room
prices9719_4room = prices[(prices['flat_type'].isin(['4 ROOM']) & prices['year'].isin([1997,2019]))].groupby(['town','year'], as_index=False).agg({'real_price': 'median'})
prices9719_4room['change'] = prices9719_4room.groupby('town')['real_price'].apply(lambda x: x.pct_change()*100)
prices9719_4room = prices9719_4room[prices9719_4room.change.notnull()]
prices9719_4room = prices9719_4room.sort_values('change', ascending=True).reset_index(drop=True).reset_index()
prices9719_4room['color'] = prices9719_4room['change'].apply(lambda x: '#00ba38' if x > 0 else '#f8766d')

## 2018 vs 2019
# all room type
prices1819 = prices[prices['year'].isin([2018,2019])].groupby(['town','year'], as_index=False).agg({'real_price': 'median'})
prices1819['change'] = prices1819.groupby('town')['real_price'].apply(lambda x: x.pct_change()*100)
prices1819 = prices1819[prices1819['change'].notnull()] 
prices1819 = prices1819.sort_values('change', ascending=True).reset_index(drop=True).reset_index()
prices1819['color'] = prices1819['change'].apply(lambda x: '#00ba38' if x > 0 else '#f8766d')

# 4-room
prices1819_4room = prices[(prices['flat_type'].isin(['4 ROOM']) & prices['year'].isin([2018,2019]))].groupby(['town','year'], as_index=False).agg({'real_price': 'median'})
prices1819_4room['change'] = prices1819_4room.groupby('town')['real_price'].apply(lambda x: x.pct_change()*100)
prices1819_4room = prices1819_4room[prices1819_4room.change.notnull()]
prices1819_4room = prices1819_4room.sort_values('change', ascending=True).reset_index(drop=True).reset_index()
prices1819_4room['color'] = prices1819_4room['change'].apply(lambda x: '#00ba38' if x > 0 else '#f8766d')
In [80]:
# Function for lollipop charts
def loll_plot(df, x, y, subtitle, xlabel, xlim):
    plt.rc('axes', axisbelow=True)
    plt.grid(linestyle='--', alpha=0.4)
    plt.hlines(y=df.index, xmin=0, xmax=df[x], color=df.color, linewidth=1)
    plt.scatter(df[x], df.index, color=df.color, s=300)
    for i, txt in enumerate(df[x]):
        plt.annotate(str(round(txt)), (txt, i), color='white', fontsize=9, ha='center', va='center')
    plt.annotate(subtitle, xy=(1, 0), xycoords='axes fraction', fontsize=20,
                    xytext=(-5, 5), textcoords='offset points',
                    ha='right', va='bottom')
    plt.yticks(df.index, df[y]); plt.xticks(fontsize=12); plt.xlim(xlim)
    plt.xlabel(xlabel, fontsize=14)
In [81]:
fig = plt.figure(figsize=(12,7))

ax1 = plt.subplot(121)
loll_plot(prices1519, 'real_price', 'town', 'All Room Types', 'Resale Price (SGD)', [50,800])
ax1.set_xticklabels(['{:,.0f}'.format(x) + 'K' for x in ax1.get_xticks()])
ax1.yaxis.set_ticks_position('none') 

ax2 = plt.subplot(122)
loll_plot(prices1519_4room, 'real_price', 'town', '4-Room', 'Resale Price (SGD)', [50,800])
ax2.set_xticklabels(['{:,.0f}'.format(x) + 'K' for x in ax2.get_xticks()])
ax2.yaxis.set_ticks_position('none') 

fig.tight_layout(pad=0, rect=[0, 0, 0.9, 0.9])
plt.suptitle('2015 to 2019, Median Price of Flats', fontsize=16)
plt.show()
In [27]:
fig = plt.figure(figsize=(12,7))

ax1 = plt.subplot(121)
loll_plot(prices9719, 'change', 'town', 'All Room Types', 'Percent Change (%)', [-40,125])

ax2 = plt.subplot(122)
loll_plot(prices9719_4room, 'change', 'town', '4-Room', 'Percent Change (%)', [-40,125])

fig.tight_layout(pad=0, rect=[0, 0, 0.9, 0.9])
plt.suptitle('1997 vs 2019, Median Price of Flats', fontsize=16)
plt.show()

Queenstown appears to have one of the highest increases in resale price, which could be due to it being developed over the past 2 decades.

In [28]:
fig = plt.figure(figsize=(12,7))

ax1 = plt.subplot(121)
loll_plot(prices1819, 'change', 'town', 'All Room Types', 'Percent Change (%)', [-30,20])

ax2 = plt.subplot(122)
loll_plot(prices1819_4room, 'change', 'town', '4-Room', 'Percent Change (%)', [-30,20])

fig.tight_layout(pad=0.5, rect=[0, 0, 0.9, 0.9])
plt.suptitle('2018 vs 2019, Median Price of Flats', fontsize=16)
plt.show()

The changes are not very large from 2018 to 2019, although prices for Toa Payoh has dropped and Central Area 4-room flats have also dropped by 20%. Could it be because these areas have older HDB meaning their lease are now shorter? As shown below, it seems that places like Punggol, and Sengkang, tend to have later lease commence date, as they were developed later, which might have led to their slight increase in prices, while places like Toa Payoh and Central Area, tend to have older lease commence date.

In [85]:
prices[prices['year'].isin([2018,2019])].groupby('town')['lease_commence_date'].median().sort_values()
Out[85]:
town
MARINE PARADE      1975
ANG MO KIO         1980
BEDOK              1980
CLEMENTI           1980
KALLANG/WHAMPOA    1982
GEYLANG            1982
TOA PAYOH          1984
JURONG EAST        1984
CENTRAL AREA       1984
SERANGOON          1986
BUKIT MERAH        1986
BUKIT BATOK        1986
YISHUN             1988
BUKIT TIMAH        1988
BISHAN             1988
TAMPINES           1988
HOUGANG            1989
PASIR RIS          1993
QUEENSTOWN         1995
CHOA CHU KANG      1996
JURONG WEST        1997
WOODLANDS          1997
BUKIT PANJANG      1999
SEMBAWANG          2001
SENGKANG           2006
PUNGGOL            2013
Name: lease_commence_date, dtype: int64

3.3. By Storeys

In [29]:
fig = plt.figure(figsize=(12,4))

# Storey Prices
ax1 = plt.subplot(121)
storey = prices.groupby('storey_range')['real_price'].median().reset_index().sort_values(by='storey_range')
storey['storey_rank'] = storey['storey_range'].astype('category').cat.codes # label encode
a=sns.scatterplot(x=storey['storey_rank'], y=storey['real_price'], s=storey['storey_rank'].astype('int')*30, color='#00994d', edgecolors='w', alpha=0.5, ax=ax1)
ylabels = ['{:,.0f}'.format(x) + 'K' for x in a.get_yticks()/1000]
ax1.set_yticklabels(ylabels)
ax1.set_xticklabels(pd.Series(['']).append(storey.iloc[[0,5,10,15,20,24],0]))
ax1.set_ylim([280000,1100000]), ax1.set_ylabel('Resale Price SGD ($)', size=15), ax1.set_xlabel('Storey', size=15)
ax1.set_title('All Years', size=15)

# Floor Area Prices
ax2 = plt.subplot(122)
storey2 = prices[prices['year'].isin([2015,2016,2017,2018,2019])].groupby('storey_range')['real_price'].median().reset_index().sort_values(by='storey_range')
storey2['storey_rank'] = storey2['storey_range'].astype('category').cat.codes

# Bubble chart
b=sns.scatterplot(x=storey2['storey_rank'], y=storey2['real_price'], s=storey2['storey_rank'].astype('int')*30, color='#00994d', edgecolors='w', alpha=0.5, ax=ax2)
ylabels = ['{:,.0f}'.format(x) + 'K' for x in ax2.get_yticks()/1000]
ax2.set_yticklabels(ylabels); ax2.set_ylabel('')
ax2.set_xticks([0,4,8,12,16])
ax2.set_xticklabels(storey2.iloc[[0,4,8,12,16],0])
ax2.set_ylim([280000,1100000]), ax2.set_xlabel('Storey', size=15)
ax2.set_title('2015 to 2019', size=15)

plt.show()

Surprisingly, this follows a linear relationship, with higher storeys being sold at a higher price.

3.4. By Floor Area

In [30]:
# Floor Area Prices
area = prices[prices['year'].isin([2015,2016,2017,2018,2019])]
p=sns.regplot(x='floor_area_sqm', y='real_price', data=area, scatter_kws={"s": 1, 'alpha':0.5})
ylabels = ['{:,.0f}'.format(x) + 'K' for x in p.get_yticks()/1000]
p.set_yticklabels(ylabels)
p.set_ylabel('Resale Price SGD ($)', size=15)
p.set_xlabel('Floor Area (Square Meters)', size=15)
plt.close()

In [31]:
display(area[area['floor_area_sqm'] > 200])
month town flat_type block street_name storey_range floor_area_sqm flat_model lease_commence_date resale_price remaining_lease cpi real_price year
712167 2015-03-01 KALLANG/WHAMPOA 3 ROOM 53 JLN MA'MOR 01 TO 03 280.0 Terrace 1972 1060000.0 56.000000 111.472 9.509114e+05 2015
716959 2015-06-01 KALLANG/WHAMPOA 3 ROOM 60 JLN BAHAGIA 01 TO 03 241.0 Terrace 1972 958000.0 56.000000 109.812 8.724001e+05 2015
745544 2016-12-01 KALLANG/WHAMPOA 3 ROOM 57 JLN MA'MOR 01 TO 03 259.0 Terrace 1972 1150000.0 54.000000 103.855 1.107313e+06 2016
755075 2017-06-01 KALLANG/WHAMPOA 3 ROOM 38 JLN BAHAGIA 01 TO 03 215.0 Terrace 1972 830000.0 54.083333 103.097 8.050671e+05 2017
760156 2017-09-01 CHOA CHU KANG EXECUTIVE 641 CHOA CHU KANG ST 64 16 TO 18 215.0 Executive Maisonette 1998 888000.0 79.333333 102.393 8.672468e+05 2017
765900 2017-12-01 KALLANG/WHAMPOA 3 ROOM 65 JLN MA'MOR 01 TO 03 249.0 Terrace 1972 1053888.0 53.583333 101.479 1.038528e+06 2017
767035 2018-01-01 CHOA CHU KANG EXECUTIVE 639 CHOA CHU KANG ST 64 10 TO 12 215.0 Executive Maisonette 1998 900000.0 79.000000 100.274 8.975407e+05 2018
781983 2018-09-01 KALLANG/WHAMPOA 3 ROOM 41 JLN BAHAGIA 01 TO 03 237.0 Terrace 1972 1185000.0 52.833333 101.889 1.163030e+06 2018

Those cases on top right of the chart consists of flats that are either Terrace or Executive Maisonette, which is not surprising.

3.5. By Block Number

3 digit system was introduced in the 1970s, with the 1st digit representing a neighbourhood in a town. So for e.g., AMK neighbourhood 1 starts with 101, and AMK neighbourhood 2 starts with 201. So first digit was separated from last 2 digits and plotted separately

In [32]:
import re

# Block Number Prices
get_num = lambda x: int(re.findall("\d+", x)[0])
prices['blocknum'] = prices['block'].apply(get_num) # get only digits from block number
tmp = prices[prices['blocknum'] > 99] # get only blocks that use 3-digit numbering system
tmp = tmp.groupby('blocknum')['real_price'].median().reset_index()

# Scatterplots
fig = plt.figure(figsize=(12,4))

ax1 = plt.subplot(121)
a=sns.scatterplot(x=tmp['blocknum'].apply(lambda x: int(str(x)[0])), y=tmp['real_price'], color='#ff9933', edgecolors='w', alpha=0.9)
ylabels = ['{:,.0f}'.format(x) + 'K' for x in a.get_yticks()/1000]
ax1.set_yticklabels(ylabels)
ax1.set_ylabel('Resale Price SGD ($)', size=15), ax1.set_xlabel('Neighbourhood Number', size=15)

ax2 = plt.subplot(122)
b=sns.scatterplot(x=tmp['blocknum'].apply(lambda x: int(str(x)[1:])), y=tmp['real_price'], edgecolors='w', alpha=0.9)
ax2.set_yticklabels(ylabels)
ax2.set_ylabel('', size=15)
ax2.set_xlabel('Block Number', size=15)

plt.show()

Block number doesn't seem to influence prices.

3.6. By Flat Model

In [33]:
# Violin plots for price distribution of each flat model

fig = plt.figure(figsize=(12,7))
p=sns.violinplot(x='flat_model', y='real_price', data=prices, width=1,
                order=prices.groupby('flat_model')['real_price'].median().sort_values().reset_index()['flat_model'].tolist())
p.set_xticklabels(p.get_xticklabels(), rotation=30, ha='right'), p.set_xlabel('Flat Models', size=15)
ylabels = ['{:,.0f}'.format(x) + 'K' for x in p.get_yticks()/1000]
p.set_yticklabels(ylabels)
p.set_ylabel('Resale Price SGD ($)', size=15)
plt.show()

The special models like the Type S1S2 (The Pinnacle@Duxton) and Terrace tend to fetch higher prices while the older models from the 1900s tend to go lower.

In [34]:
# ridgeline plot

# tmp = prices.set_index('flat_model')
# tmp = tmp.loc[prices.groupby('flat_model')['real_price'].median().sort_values().reset_index()['flat_model'].tolist()].reset_index().groupby("flat_model", sort=False)
# fig, axes = joypy.joyplot(tmp, by="flat_model", column="real_price",figsize=(12,5),
#              linewidth=0.05,overlap=1.5,alpha=0.8,colormap=plt.cm.get_cmap('tab20',16))
# axes[-1].set_xlim([-50000,1400000])
# axes[-1].set_xticklabels(['0', '200k', '400k', '600k', '800k', '1000k', '1200k', '1400k']) 
# plt.xlabel('Resale Price SGD ($)', fontsize=14)
# fig.show()

3.7. By Lease Commence Date

In [35]:
# Boxplot for each year of lease commence date

fig = plt.figure(figsize=(7,9))
p=sns.boxplot(y='lease_commence_date', x='real_price', data=prices, width=1, orient='h', flierprops = dict(markerfacecolor = 'red', markersize = 0.1, linestyle='none'), linewidth=0.4)
p.set_xlabel('Resale Price SGD ($)', size=15), p.set_ylabel('Lease Commence Year', size=15)
xlabels = ['{:,.0f}'.format(x) + 'K' for x in p.get_xticks()/1000]
p.set_xticklabels(xlabels)
p.set_title('Resale Price By Lease Commence Year', size=15)
plt.show()
In [36]:
tmp = prices[prices['year'].isin([2015,2016,2017,2018,2019])]
fig, axes = joypy.joyplot(tmp, by="lease_commence_date", column="real_price",figsize=(6,10),
             linewidth=1,overlap=5,alpha=0.8,colormap=plt.cm.get_cmap('tab20',16))
axes[-1].set_xlim([-50000,1400000])
axes[-1].set_xticklabels(['0', '200k', '400k', '600k', '800k', '1000k', '1200k', '1400k']) 
plt.xlabel('Resale Price SGD ($)', fontsize=14)
fig.show()

3.8. By Distance to Nearest Amenities

The names of schools, supermarkets, hawkers, shopping malls, parks and MRTs were downloaded/scraped from Data.gov.sg and Wikipedia and fed through a function that uses OneMap.sg api to get their coordinates (latitude and longitude). These coordinates were then fed through other functions which uses geopy package to get the distance between locations. By doing this, the nearest distance of each amenity from each house can be computed, as well as the number of each amenity within a 2km radius of each flat.

The script for this can be found here.

Some of Kallang/Whampoa has no distance due to search error on OneMap.sg.

In [24]:
flat_amenities = pd.read_csv('Data/flat_amenities.csv')

# merge amenities data to flat data
prices1519 = prices[prices['year'].isin([2015,2016,2017,2018,2019])]
prices1519['flat'] = prices['block'] + ' ' + prices['street_name']
prices1519 = prices1519.merge(flat_amenities, on='flat', how='left')

# reduce number of class of town to regions
d_region = {'ANG MO KIO':'North East', 'BEDOK':'East', 'BISHAN':'Central', 'BUKIT BATOK':'West', 'BUKIT MERAH':'Central',
       'BUKIT PANJANG':'West', 'BUKIT TIMAH':'Central', 'CENTRAL AREA':'Central', 'CHOA CHU KANG':'West',
       'CLEMENTI':'West', 'GEYLANG':'Central', 'HOUGANG':'North East', 'JURONG EAST':'West', 'JURONG WEST':'West',
       'KALLANG/WHAMPOA':'Central', 'MARINE PARADE':'Central', 'PASIR RIS':'East', 'PUNGGOL':'North East',
       'QUEENSTOWN':'Central', 'SEMBAWANG':'North', 'SENGKANG':'North East', 'SERANGOON':'North East', 'TAMPINES':'East',
       'TOA PAYOH':'Central', 'WOODLANDS':'North', 'YISHUN':'North'}
prices1519['region'] = prices1519['town'].map(d_region)
colors = {'North East':'Purple', 'East':'Green', 'Central':'Brown', 'West':'Red', 'North':'Orange'}
In [83]:
# get median info of each town
tmp = prices1519.groupby('town')[['dist_dhoby','school_dist','num_school_2km','hawker_dist','num_hawker_2km','park_dist','num_park_2km','mall_dist','num_mall_2km','mrt_dist','num_mrt_2km','supermarket_dist','num_supermarket_2km','real_price']].median().reset_index()
tmp['region'] = tmp['town'].map(d_region)

# Scatterplot with names of towns
fig, ax = plt.subplots(figsize=(8,5))
grouped = tmp.groupby('region')
for key, group in grouped:
    group.plot(ax=ax, kind='scatter', x='dist_dhoby', y='real_price', label=key, color=colors[key], s=60)
b, a = np.polyfit(tmp['dist_dhoby'], tmp['real_price'], 1)
ax.plot(tmp['dist_dhoby'], a + b* tmp['dist_dhoby'], '-')  
ax.set_xlim([0,20]), ax.set_xlabel('Distance from Dhoby Ghaut MRT (km)', size=15)
ylabels = ['{:,.0f}'.format(x) + 'K' for x in ax.get_yticks()/1000]
ax.set_yticklabels(ylabels), ax.set_ylabel('Resale Price SGD ($)', size=15)
for i, txt in enumerate(tmp['town']):
    ax.annotate(txt, (tmp['dist_dhoby'][i]+0.3, tmp['real_price'][i]), size=8, alpha=0.9)

plt.show()
In [86]:
prices1519.groupby('region')['real_price'].median()
Out[86]:
region
Central       485659.048669
East          413630.881781
North         351797.804703
North East    410850.094496
West          371119.119471
Name: real_price, dtype: float64

Relationship is negative, with flats that are further away from Dhoby Ghaut MRT (Central), having lower resale prices.

In [39]:
# scatterplot for median price of each town against nearest distance from each amenity

p=sns.pairplot(tmp, x_vars=["school_dist", "hawker_dist", "park_dist", "mall_dist", "mrt_dist", "supermarket_dist"], y_vars=["real_price"], height=3, aspect=1, kind="reg", plot_kws=dict(scatter_kws=dict(s=40)))
axes=p.axes
ylabels = ['{:,.0f}'.format(x) + 'K' for x in axes[0,0].get_yticks()/1000]
axes[0,0].set_yticklabels(ylabels), axes[0,0].set_ylabel('Resale Price SGD ($)', size=10)
axes[0,0].set_xlabel('Distance From School (km)', size=10), axes[0,1].set_xlabel('Distance From Hawker (km)', size=10)
axes[0,2].set_xlabel('Distance From Park (km)', size=10), axes[0,3].set_xlabel('Distance From Mall (km)', size=10)
axes[0,4].set_xlabel('Distance From MRT (km)', size=10), axes[0,5].set_xlabel('Distance From Supermarket (km)', size=10)
plt.suptitle('Resale Price (Median of Each Town) VS Distance from Nearest Amenities (Median of Each Town)')
plt.tight_layout(pad=0, rect=[0, 0, 0.9, 0.9])
plt.show()

Marine Parage is far from MRT station.

In [40]:
# scatterplot for price of each flat against nearest distance from each amenity

p=sns.pairplot(prices1519[prices1519['school_dist']<3], x_vars=["school_dist", "hawker_dist", "park_dist", "mall_dist", "mrt_dist", "supermarket_dist"], y_vars=["real_price"], height=3, aspect=1, kind="reg", plot_kws=dict(scatter_kws=dict(s=0.5,alpha=0.3), line_kws=dict(color='#ff7f0e'))) # remove outliers (>3km)
axes=p.axes
ylabels = ['{:,.0f}'.format(x) + 'K' for x in axes[0,0].get_yticks()/1000]
axes[0,0].set_yticklabels(ylabels), axes[0,0].set_ylabel('Resale Price SGD ($)', size=10)
axes[0,0].set_xlabel('Distance From School (km)', size=10), axes[0,1].set_xlabel('Distance From Hawker (km)', size=10)
axes[0,2].set_xlabel('Distance From Park (km)', size=10), axes[0,3].set_xlabel('Distance From Mall (km)', size=10)
axes[0,4].set_xlabel('Distance From MRT (km)', size=10), axes[0,5].set_xlabel('Distance From Supermarket (km)', size=10)
plt.suptitle('Resale Price VS Distance from Nearest Amenities')
plt.tight_layout(pad=0, rect=[0, 0, 0.9, 0.9])
plt.close()

Surprisingly, the relationship with school is not as strong as expected, while distance from hawker is more pronounced from the median plots.

3.9. By Number of Amenities in 2km Radius

In [41]:
# scatterplot for median price of each town against number of amenities

p=sns.pairplot(tmp, x_vars=["num_school_2km", "num_hawker_2km", "num_park_2km", "num_mall_2km", "num_mrt_2km", "num_supermarket_2km"], y_vars=["real_price"], height=3, aspect=1, kind="reg", plot_kws=dict(scatter_kws=dict(s=40)))
axes=p.axes
ylabels = ['{:,.0f}'.format(x) + 'K' for x in axes[0,0].get_yticks()/1000]
axes[0,0].set_yticklabels(ylabels), axes[0,0].set_ylabel('Resale Price SGD ($)', size=10)
axes[0,0].set_xlabel('Number of Schools', size=10), axes[0,1].set_xlabel('Number of Hawkers', size=10)
axes[0,2].set_xlabel('Number of Parks', size=10), axes[0,3].set_xlabel('Number of Malls', size=10)
axes[0,4].set_xlabel('Number of MRTs', size=10), axes[0,5].set_xlabel('Number of Supermarkets', size=10)
plt.suptitle('Resale Price (Median of Each Town) VS Number of Amenities in 2km Radius (Median of Each Town)')
plt.tight_layout(pad=0, rect=[0, 0, 0.9, 0.9])
plt.show()
In [42]:
# scatterplot for price of each flat against number of amenities

p=sns.pairplot(prices1519, x_vars=["num_school_2km", "num_hawker_2km", "num_park_2km", "num_mall_2km", "num_mrt_2km", "num_supermarket_2km"], y_vars=["real_price"], height=3, aspect=1, kind="reg", plot_kws=dict(scatter_kws=dict(s=0.5,alpha=0.3), line_kws=dict(color='#ff7f0e')))
axes=p.axes
ylabels = ['{:,.0f}'.format(x) + 'K' for x in axes[0,0].get_yticks()/1000]
axes[0,0].set_yticklabels(ylabels), axes[0,0].set_ylabel('Resale Price SGD ($)', size=10)
axes[0,0].set_xlabel('Number of Schools', size=10), axes[0,1].set_xlabel('Number of Hawkers', size=10)
axes[0,2].set_xlabel('Number of Parks', size=10), axes[0,3].set_xlabel('Number of Malls', size=10)
axes[0,4].set_xlabel('Number of MRTs', size=10), axes[0,5].set_xlabel('Number of Supermarkets', size=10)
plt.suptitle('Resale Price VS Number of Amenities in 2km Radius')
plt.tight_layout(pad=0, rect=[0, 0, 0.9, 0.9])
plt.close()

Punggol has its own LRT and Central Area has lots of stations and malls nearby.

4. Data Preparation

Focus on only data from 2015 to 2019

In [43]:
# clear unused variables
del(price1999, price2012, price2014, price2016, price2017, prices1819, prices1819_4room, prices9719, prices9719_4room,
    storey, storey2, tmp, xlabels, ylabels, p, grouped, flattype2019, flattype, flat_amenities, cpi, ax, ax1, ax2, area)
import gc
gc.collect()
Out[43]:
235467

4.1. Missing Values

Replace missing distance values with median of the town. Only Kallang/Whampoa has missing data, so the function below will replace them with the median of the Kallang/Whampoa distance variables.

In [85]:
df = prices1519[['town', 'flat_type', 'storey_range', 'floor_area_sqm', 'flat_model', 'lease_commence_date', 'year', 'school_dist', 'num_school_2km', 'hawker_dist', 'num_hawker_2km', 'park_dist', 'num_park_2km', 'mall_dist', 'num_mall_2km', 'mrt_dist', 'num_mrt_2km', 'supermarket_dist', 'num_supermarket_2km', 'dist_dhoby', 'region', 'real_price']]

# function for replacing NAs with median of the town
def replace_NA_median(df, columns):
    for c in columns:      
        df[c] = df.groupby("town").transform(lambda x: x.fillna(x.median()))[c]
    return df

df = replace_NA_median(df, ['school_dist', 'num_school_2km', 'hawker_dist',
       'num_hawker_2km', 'park_dist', 'num_park_2km', 'mall_dist',
       'num_mall_2km', 'mrt_dist', 'num_mrt_2km', 'supermarket_dist',
       'num_supermarket_2km', 'dist_dhoby'])
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 101409 entries, 0 to 101408
Data columns (total 22 columns):
town                   101409 non-null object
flat_type              101409 non-null object
storey_range           101409 non-null object
floor_area_sqm         101409 non-null float64
flat_model             101409 non-null object
lease_commence_date    101409 non-null int64
year                   101409 non-null int64
school_dist            101409 non-null float64
num_school_2km         101409 non-null float64
hawker_dist            101409 non-null float64
num_hawker_2km         101409 non-null float64
park_dist              101409 non-null float64
num_park_2km           101409 non-null float64
mall_dist              101409 non-null float64
num_mall_2km           101409 non-null float64
mrt_dist               101409 non-null float64
num_mrt_2km            101409 non-null float64
supermarket_dist       101409 non-null float64
num_supermarket_2km    101409 non-null float64
dist_dhoby             101409 non-null float64
region                 101409 non-null object
real_price             101409 non-null float64
dtypes: float64(15), int64(2), object(5)
memory usage: 17.8+ MB

4.2. Multicollinearity

Multicollinearity occurs when two or more independent variables are highly correlated with one another in a regression model. Multicollinearity can be a problem in a regression model because we would not be able to distinguish between the individual effects of the independent variables on the dependent variable.

If our goal is one of classification or prediction and not to look at the importance or contribution of each feature, we do not need to deal with multicollinearity as it does not affect the overall prediction or goodness of fit. It just affects the p-value and the coefficients. But since we are interested in feature importance by looking at the coefficients of the linear regression model, we have to focus on it.

  • If the largest VIF is greater than 10 then there is cause for concern (Bowerman & O’Connell, 1990; Myers, 1990)
  • If the average VIF is substantially greater than 1 then the regression may be biased (Bowerman & O’Connell, 1990).
  • Tolerance below 0.1 indicates a serious problem.
  • Tolerance below 0.2 indicates a potential problem (Menard, 1995).
In [45]:
# Correlation heatmap
fig = plt.figure(figsize=(10,10))
ax = sns.heatmap(df.select_dtypes(include=['int64','float64']).corr(), annot = True, fmt='.2g', 
    vmin=-1, vmax=1, center= 0, cmap= 'coolwarm_r', linecolor='black', linewidth=1, annot_kws={"size": 7})
#ax.set_ylim(0 ,5)
plt.xticks(rotation=45, ha='right')
plt.title('Correlation Heatmap')
fig.show()
In [46]:
# Multicollinearity
# Import library for VIF
from statsmodels.stats.outliers_influence import variance_inflation_factor

def calc_vif(X):
    # Calculating VIF
    vif = pd.DataFrame()
    vif["variables"] = X.columns
    vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
    vif['tolerance'] = 1/vif.VIF
    vif['meanVIF'] = vif.VIF.mean()

    return(vif)

calc_vif(df.drop('real_price',axis=1).select_dtypes(include=['int64','float64']))
Out[46]:
variables VIF tolerance meanVIF
0 floor_area_sqm 19.894200 0.050266 5548.765039
1 lease_commence_date 44562.410995 0.000022 5548.765039
2 year 44064.404189 0.000023 5548.765039
3 school_dist 4.706948 0.212452 5548.765039
4 num_school_2km 15.415693 0.064869 5548.765039
5 hawker_dist 10.903288 0.091715 5548.765039
6 num_hawker_2km 5.347933 0.186988 5548.765039
7 park_dist 6.314078 0.158376 5548.765039
8 num_park_2km 5.175563 0.193216 5548.765039
9 mall_dist 6.065183 0.164875 5548.765039
10 num_mall_2km 6.806727 0.146913 5548.765039
11 mrt_dist 5.348409 0.186972 5548.765039
12 num_mrt_2km 10.385980 0.096284 5548.765039
13 supermarket_dist 4.711751 0.212235 5548.765039
14 num_supermarket_2km 27.097202 0.036904 5548.765039
15 dist_dhoby 25.252494 0.039600 5548.765039
In [47]:
calc_vif(df.drop(['real_price','num_supermarket_2km','year','num_school_2km','dist_dhoby'],axis=1).select_dtypes(include=['int64','float64']))
Out[47]:
variables VIF tolerance meanVIF
0 floor_area_sqm 19.454717 0.051401 10.562684
1 lease_commence_date 53.075649 0.018841 10.562684
2 school_dist 4.350898 0.229838 10.562684
3 hawker_dist 8.311330 0.120318 10.562684
4 num_hawker_2km 4.163071 0.240207 10.562684
5 park_dist 6.172912 0.161998 10.562684
6 num_park_2km 4.309959 0.232021 10.562684
7 mall_dist 5.474837 0.182654 10.562684
8 num_mall_2km 5.810105 0.172114 10.562684
9 mrt_dist 5.312813 0.188224 10.562684
10 num_mrt_2km 5.916204 0.169027 10.562684
11 supermarket_dist 4.399711 0.227288 10.562684
In [86]:
# drop columns
lr_df = df.drop(['num_supermarket_2km','year','num_school_2km','dist_dhoby'], axis=1)

There is definitely multicollinearity in our continuous features. Even after removing several features, the highest VIF is still around 50 (lease_commence_date) and floor_area_sqm is above 10 as well. I chose to left them in as I believe that they are really important contributors to resale prices. One way is to refit another model without them and compare the output.

4.3. Normality

In [49]:
# Plot distribution for each continuous variable
lr_df.hist(bins=50, figsize=(15, 10), grid=False, edgecolor='black')
plt.tight_layout(pad=0, rect=[0, 0, 0.9, 0.9])
plt.show()

Not all the variables follow a normal distribution, and most of the distances variables have some outliers. For these outliers, its better to use a multivariable approach like mahalanobis distance to see if they are outliers instead of using a univariate approach. If needed, some of these variables can be transformed to reduce the skewness.

In [50]:
# plot qqplot before and after log transformation

from statsmodels.api import qqplot
fig, ((ax1,ax2), (ax3,ax4)) = plt.subplots(2,2,figsize=(5,5))

ax1.hist(lr_df['real_price'], bins=50, edgecolor='black')
qqplot(lr_df['real_price'], line='s', ax=ax2)
ax3.hist(np.log(lr_df['real_price']), bins=50, edgecolor='black')
qqplot(np.log(lr_df['real_price']), line='s', ax=ax4)
plt.suptitle('Real Price Normality Before (Top) & After (Bottom) Logging')
plt.tight_layout(pad=0, rect=[0, 0, 0.9, 0.9])
plt.close()

We will fit the linear regression first and come back to check on the normality of the residuals as well as homoscedasticity.

4.4. Label & Dummy Encoding

In [27]:
# Frequency plots for catergorical features
fig = plt.figure(figsize=(30,5))
for count, col in enumerate(df.select_dtypes(include=['category','object']).columns):
    fig.add_subplot(1,5,count+1)
    df[col].value_counts().plot.barh()
    plt.title(col)
    plt.yticks(rotation=45)
    plt.tight_layout(pad=0, rect=[0, 0, 0.9, 0.9])

Region might be a better choice instead of town since town has lots of classes and one-hot encoding it might lead to a very sparse matrix. For flat_type, we can remove Multi Generation and 1 Room since there are not many instances of them. storey_range will be label encoded according to their levels while flat_model should be further grouped to reduce the number of classes.

In [87]:
# label encode storeys
df = df.sort_values(by='storey_range')
df['storey_range'] = df['storey_range'].astype('category').cat.codes # label encode
lr_df = lr_df.sort_values(by='storey_range')
lr_df['storey_range'] = lr_df['storey_range'].astype('category').cat.codes # label encode

# remove flat types with very few cases
df = df[~df['flat_type'].isin(['MULTI GENERATION', '1 ROOM'])]
lr_df = lr_df[~lr_df['flat_type'].isin(['MULTI GENERATION', '1 ROOM'])]

# Re-categorize flat model to reduce num classes
replace_values = {'Executive Maisonette':'Maisonette', 'Terrace':'Special', 'Adjoined flat':'Special', 
                    'Type S1S2':'Special', 'DBSS':'Special', 'Model A2':'Model A', 'Premium Apartment':'Apartment', 'Improved':'Standard', 'Simplified':'Model A', '2-room':'Standard'}
df = df.replace({'flat_model': replace_values})
lr_df = lr_df.replace({'flat_model': replace_values})

# Label encode flat type
replace_values = {'2 ROOM':0, '3 ROOM':1, '4 ROOM':2, '5 ROOM':3, 'EXECUTIVE':4}
df = df.replace({'flat_type': replace_values})
lr_df = lr_df.replace({'flat_type': replace_values})

df = df.reset_index(drop=True)
display(df['flat_model'].value_counts())
lr_df = lr_df.reset_index(drop=True)
display(lr_df['flat_model'].value_counts())
Model A           37973
Standard          28539
New Generation    15389
Apartment         14532
Maisonette         3151
Special            1744
Name: flat_model, dtype: int64
Model A           37973
Standard          28539
New Generation    15389
Apartment         14532
Maisonette         3151
Special            1744
Name: flat_model, dtype: int64

Storey_range and flat_type were label encoded since they are ordinal.

In [88]:
display(lr_df.head())
town flat_type storey_range floor_area_sqm flat_model lease_commence_date school_dist hawker_dist num_hawker_2km park_dist num_park_2km mall_dist num_mall_2km mrt_dist num_mrt_2km supermarket_dist region real_price
0 TAMPINES 2 0 108.0 Model A 1985 0.359916 1.015992 1.0 0.591787 8.0 0.222197 6.0 0.265708 5.0 0.197700 East 470763.371323
1 BUKIT BATOK 2 0 103.0 Model A 1986 0.428729 2.476292 0.0 0.976805 10.0 1.395023 4.0 0.542323 6.0 0.334088 West 368625.028643
2 JURONG WEST 1 0 74.0 Model A 1984 0.174088 0.441678 4.0 0.725144 4.0 0.907439 3.0 0.229473 3.0 0.467091 West 253396.280952
3 JURONG WEST 1 0 74.0 Model A 1986 0.409651 0.095477 3.0 1.147330 4.0 0.772041 3.0 0.753994 3.0 0.194830 West 263833.992095
4 JURONG WEST 1 0 74.0 Model A 1985 0.437646 0.522134 4.0 1.329356 3.0 1.276048 2.0 0.806693 2.0 0.301456 West 261857.707510
In [89]:
## dummy encoding
df = pd.get_dummies(df, columns=['region'], prefix=['region'], drop_first=True) # central is baseline
df = pd.get_dummies(df, columns=['flat_model'], prefix=['model'])
df= df.drop('model_Standard',axis=1) # remove standard, setting it as the baseline
lr_df = pd.get_dummies(lr_df, columns=['region'], prefix=['region'], drop_first=True) # central is baseline
lr_df = pd.get_dummies(lr_df, columns=['flat_model'], prefix=['model'])
lr_df= lr_df.drop('model_Standard',axis=1) # remove standard, setting it as the baseline

Region and flat_model were dummy encoded, with Central region and Standard model selected as the baseline to which other classes are compared to.

4.5. Feature Scaling

Scaling is only done for linear regression. Tree-based models do not require scaling as it does not affect performance.

In [90]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

# fit to continuous columns and transform
scaled_columns = ['floor_area_sqm','lease_commence_date','school_dist','hawker_dist','num_hawker_2km','park_dist',
                    'num_park_2km', 'mall_dist', 'num_mall_2km', 'mrt_dist', 'num_mrt_2km', 'supermarket_dist']
scaler.fit(lr_df[scaled_columns])
scaled_columns = pd.DataFrame(scaler.transform(lr_df[scaled_columns]), index=lr_df.index, columns=scaled_columns)

# separate unscaled features
unscaled_columns = lr_df.drop(scaled_columns, axis=1)

# concatenate scaled and unscaled features
lr_df = pd.concat([scaled_columns,unscaled_columns], axis=1)
In [91]:
display(lr_df.head())
floor_area_sqm lease_commence_date school_dist hawker_dist num_hawker_2km park_dist num_park_2km mall_dist num_mall_2km mrt_dist ... real_price region_East region_North region_North East region_West model_Apartment model_Maisonette model_Model A model_New Generation model_Special
0 0.434709 -0.637010 -0.173252 -0.584115 -0.481922 -0.483176 -0.183114 -1.174579 0.329413 -0.920217 ... 470763.371323 1 0 0 0 0 0 1 0 0
1 0.227901 -0.553487 0.112022 0.343105 -0.838824 0.416079 0.104527 1.881664 -0.309844 -0.199828 ... 368625.028643 0 0 0 1 0 0 1 0 0
2 -0.971581 -0.720534 -0.943631 -0.948777 0.588787 -0.171706 -0.758398 0.611078 -0.629472 -1.014584 ... 253396.280952 0 0 0 1 0 0 1 0 0
3 -0.971581 -0.553487 0.032932 -1.168598 0.231884 0.814358 -0.758398 0.258249 -0.629472 0.351427 ... 263833.992095 0 0 0 1 0 0 1 0 0
4 -0.971581 -0.637010 0.148990 -0.897691 0.588787 1.239501 -0.902218 1.571630 -0.949100 0.488672 ... 261857.707510 0 0 0 1 0 0 1 0 0

5 rows × 25 columns

4.6. Outlier Detection

Using Cook's Distance as Mahalanobis Distance required too much memory for the large dataset. The threshold for Cook's Distance used here is 4/n.

In [121]:
from yellowbrick.regressor import CooksDistance

lr_y = lr_df[['real_price']]
lr_X = lr_df.drop(['real_price','town'], axis=1)

yy = np.log(lr_y)['real_price']
XX = lr_X.values

visualizer = CooksDistance()
visualizer.fit(XX, yy)
#visualizer.show()
plt.close()

In [119]:
from yellowbrick.regressor import ResidualsPlot

# visualize residuals before outlier removal
model = LinearRegression()
visualizer_residuals = ResidualsPlot(model)
visualizer_residuals.fit(XX, yy)
#visualizer_residuals.show()
plt.close()

In [120]:
# remove outliers
i_less_influential = (visualizer.distance_ <= visualizer.influence_threshold_)
X_li, y_li = XX[i_less_influential], yy[i_less_influential]
lr_X, lr_y = lr_X[i_less_influential], lr_y[i_less_influential]

# visualize residuals after outliers removal
model = LinearRegression()
visualizer_residuals = ResidualsPlot(model)
visualizer_residuals.fit(X_li, y_li)
#visualizer_residuals.show()
plt.close()

After removing outliers (~5000, 5.24%), the homoscedaticity became better. The residuals are normally distributed around 0, satisfying the linearity and normality assumptions of the linear model.

5. Linear Regression

5.1. Model Building & Results

In [98]:
from sklearn.linear_model import LinearRegression

# sklearn method, which doesn't give much additional info

lin_reg = LinearRegression()
lin_reg.fit(lr_X, np.log(lr_y))

print(f'Coefficients: {lin_reg.coef_}')
print(f'Intercept: {lin_reg.intercept_}')
print(f'R^2 score: {lin_reg.score(lr_X, np.log(lr_y))}')
Coefficients: [[ 0.19011897  0.1184489   0.00854617 -0.07430303  0.0125737  -0.00978343
   0.03165512 -0.01031053 -0.00650099 -0.03773027  0.00599228 -0.01234415
   0.05427176  0.02554649 -0.17040042 -0.33082356 -0.2190222  -0.24102685
   0.02338675  0.06804495  0.03834279  0.07504922  0.16429196]]
Intercept: [12.8927495]
R^2 score: 0.896423670958943
In [99]:
# statsmodel method, which gives more info
import statsmodels.api as sm
# alternate way using statistical formula, which does not require dummy coding manually
# https://stackoverflow.com/questions/50733014/linear-regression-with-dummy-categorical-variables
# https://stackoverflow.com/questions/34007308/linear-regression-analysis-with-string-categorical-features-variables

X_constant = sm.add_constant(lr_X)
lin_reg = sm.OLS(np.log(lr_y),X_constant).fit()
lin_reg.summary()
Out[99]:
OLS Regression Results
Dep. Variable: real_price R-squared: 0.896
Model: OLS Adj. R-squared: 0.896
Method: Least Squares F-statistic: 3.612e+04
Date: Wed, 11 Nov 2020 Prob (F-statistic): 0.00
Time: 12:38:19 Log-Likelihood: 87514.
No. Observations: 96016 AIC: -1.750e+05
Df Residuals: 95992 BIC: -1.748e+05
Df Model: 23
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 12.8927 0.003 4021.957 0.000 12.886 12.899
floor_area_sqm 0.1901 0.001 165.517 0.000 0.188 0.192
lease_commence_date 0.1184 0.001 236.867 0.000 0.117 0.119
school_dist 0.0085 0.000 24.011 0.000 0.008 0.009
hawker_dist -0.0743 0.001 -118.001 0.000 -0.076 -0.073
num_hawker_2km 0.0126 0.001 22.400 0.000 0.011 0.014
park_dist -0.0098 0.000 -26.387 0.000 -0.011 -0.009
num_park_2km 0.0317 0.000 67.091 0.000 0.031 0.033
mall_dist -0.0103 0.000 -26.800 0.000 -0.011 -0.010
num_mall_2km -0.0065 0.000 -14.196 0.000 -0.007 -0.006
mrt_dist -0.0377 0.000 -93.404 0.000 -0.039 -0.037
num_mrt_2km 0.0060 0.001 9.340 0.000 0.005 0.007
supermarket_dist -0.0123 0.000 -35.495 0.000 -0.013 -0.012
flat_type 0.0543 0.001 41.128 0.000 0.052 0.057
storey_range 0.0255 0.000 133.886 0.000 0.025 0.026
region_East -0.1704 0.001 -130.824 0.000 -0.173 -0.168
region_North -0.3308 0.002 -218.911 0.000 -0.334 -0.328
region_North East -0.2190 0.001 -155.558 0.000 -0.222 -0.216
region_West -0.2410 0.001 -170.033 0.000 -0.244 -0.238
model_Apartment 0.0234 0.001 20.271 0.000 0.021 0.026
model_Maisonette 0.0680 0.002 32.260 0.000 0.064 0.072
model_Model A 0.0383 0.001 39.651 0.000 0.036 0.040
model_New Generation 0.0750 0.001 61.867 0.000 0.073 0.077
model_Special 0.1643 0.003 54.791 0.000 0.158 0.170
Omnibus: 609.584 Durbin-Watson: 1.785
Prob(Omnibus): 0.000 Jarque-Bera (JB): 470.754
Skew: 0.085 Prob(JB): 5.99e-103
Kurtosis: 2.702 Cond. No. 42.0


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Since the sample is huge, most variables/features will be significant although they might not be practically significant.

In [108]:
# scatterplot of y (observed) and yhat (predicted)

plt.style.use('default')
fig = plt.figure(figsize=(5,3))
ax = sns.scatterplot(x=np.log(lr_y)['real_price'], y=lin_reg.predict(), edgecolors='w', alpha=0.9, s=8)
ax.set_xlabel('Observed')#, ax.set_xticklabels(['{:,.0f}'.format(x) + 'K' for x in ax.get_xticks()/1000])
ax.set_ylabel('Predicted')#, ax.set_yticklabels(['{:,.0f}'.format(x) + 'K' for x in ax.get_yticks()/1000])
ax.annotate('Adjusted R\u00b2: ' + str(format(round(lin_reg.rsquared_adj,2),'.2f')), xy=(0, 1), xytext=(25, -25),
    xycoords='axes fraction', textcoords='offset points', fontsize=12)
plt.close()

The selected features are able to account for 90% of the variance in HDB resale prices.

5.2. Homoscedasticity and Normality of Residuals

In [110]:
# Homoscedasticity and Normality of Residuals
pred = lin_reg.predict()
resids = lin_reg.resid
resids_studentized = lin_reg.get_influence().resid_studentized_internal

fig = plt.figure(figsize=(10,3))

ax1 = plt.subplot(121)
sns.scatterplot(x=pred, y=resids_studentized, edgecolors='w', alpha=0.9, s=8)
ax1.set_xlabel('Predicted Values')
ax1.set_ylabel('Studentized Residuals')

ax2 = plt.subplot(122)
sns.distplot(resids_studentized, norm_hist=True, hist_kws=dict(edgecolor='w'))
ax2.set_xlabel('Studentized Residual')

plt.close()

Homoscedaticity appears to be satisfied. The residuals are normally distributed around 0, satisfying the linearity and normality assumptions of the linear model.

5.3. Feature Importance

In [112]:
# Feature Importance

lr_results = pd.read_html(lin_reg.summary().tables[1].as_html(),header=0,index_col=0)[0]
coefs = lr_results[['coef']][1:].reset_index().rename(columns={'index':'feature'})
coefs['feature_importance'] = np.abs(coefs['coef'])
coefs = coefs.sort_values('feature_importance').reset_index(drop=True)
coefs['color'] = coefs['coef'].apply(lambda x: '#66ff8c' if x>0 else '#ff8c66')
coefs.plot.barh(x='feature',y='feature_importance',color=coefs['color'],figsize=(4,5))
colors = {'Positive':'#66ff8c', 'Negative':'#ff8c66'}         
labels = list(colors.keys())
handles = [plt.Rectangle((0,0),1,1, color=colors[label]) for label in labels]
legend = plt.legend(handles, labels, title='Relationship', fontsize = '15')
plt.setp(legend.get_title(),fontsize='17')
plt.xlabel('Standardized Coefficients', size=15), plt.ylabel('Features', size=15)
plt.ylim([-1,23])
plt.title('Linear Regression Feature Importance', size=15)
plt.show()

It seems that region is the feature that drives resale prices the most. The Central region was used as the baseline against which all other regions are compared to. This means that all the other regions tend to be sold lower as compared to the Central area. Floor area, lease commence date and special flat models are also positive drivers of resale prices while distance to nearest hawker center is a negative driver.

Let's look at whether a non-linear model will show consistent drivers for resale prices.

6. Random Forest

In [62]:
from sklearn.model_selection import train_test_split

# Train Test Split
y = df[['real_price']]
X = df.drop(['real_price','town', 'year'], axis=1)

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.1, shuffle=True, random_state=0)
print('Shape of X_train:', X_train.shape)
print('Shape of X_test:', X_test.shape)
print('Shape of y_train:', y_train.shape)
print('Shape of y_test:', y_test.shape)
Shape of X_train: (91195, 26)
Shape of X_test: (10133, 26)
Shape of y_train: (91195, 1)
Shape of y_test: (10133, 1)

Here, I used a ratio of 9:1 for the train and test set and used both Out-Of-Bag and K-fold Cross Validation as validation methods.

6.1. Out-Of-Bag

In [63]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_absolute_error
from scipy.stats import spearmanr, pearsonr

# Validation using out-of-bag method
rf = RandomForestRegressor(n_estimators=100, oob_score=True, random_state=0)
rf.fit(X_train, y_train)
predicted_train = rf.predict(X_train)

print(f'Out-of-bag R\u00b2 score estimate: {rf.oob_score_:>5.3}')
Out-of-bag R² score estimate: 0.957
In [64]:
# predict and get evaluation metrics on test set
predicted_test = rf.predict(X_test)
oob_test_score = r2_score(y_test['real_price'], predicted_test)
spearman = spearmanr(y_test['real_price'], predicted_test)
pearson = pearsonr(y_test['real_price'], predicted_test)
oob_mae = mean_absolute_error(y_test['real_price'], predicted_test)

print(f'Test data R\u00b2 score: {oob_test_score:>5.3}')
print(f'Test data Spearman correlation: {spearman[0]:.3}')
print(f'Test data Pearson correlation: {pearson[0]:.3}')
print(f'Test data Mean Absolute Error: {round(oob_mae)}')
Test data R² score: 0.958
Test data Spearman correlation: 0.976
Test data Pearson correlation: 0.979
Test data Mean Absolute Error: 21328.0

6.2. K-fold Cross Validation

In [65]:
from sklearn.model_selection import GridSearchCV

# validation by k-fold cross validation with grid search for best hyperparameters
# hyperparameter values shown below are the tuned final values
param_grid = {
    'max_features': ['auto'], # max number of features considered for splitting a node
    'max_depth': [20], # max number of levels in each decision tree
    'min_samples_split': [15], # min number of data points placed in a node before the node is split
    'min_samples_leaf': [2]} # min number of data points allowed in a leaf node
rfr =GridSearchCV(RandomForestRegressor(n_estimators = 500, n_jobs=-1, random_state=0),
                        param_grid, cv=10, scoring='r2', return_train_score=True)
rfr.fit(X_train,y_train)
print("Best parameters set found on Cross Validation:\n\n", rfr.best_params_)
print("\nCross Validation R\u00b2 score:\n\n", rfr.best_score_.round(3))
Best parameters set found on Cross Validation:

 {'max_depth': 20, 'max_features': 'auto', 'min_samples_leaf': 2, 'min_samples_split': 15}

Cross Validation R² score:

 0.961
In [66]:
# predict and get evaluation metrics for test set

cv_predicted_test = rfr.predict(X_test)

cv_test_score = r2_score(y_test['real_price'], cv_predicted_test)
spearman = spearmanr(y_test['real_price'], cv_predicted_test)
pearson = pearsonr(y_test['real_price'], cv_predicted_test)
cv_mae = mean_absolute_error(y_test['real_price'], cv_predicted_test)

print(f'Test data R\u00b2 score: {cv_test_score:>5.3}')
print(f'Test data Spearman correlation: {spearman[0]:.3}')
print(f'Test data Pearson correlation: {pearson[0]:.3}')
print(f'Test data Mean Absolute Error: {round(cv_mae)}')
Test data R² score: 0.962
Test data Spearman correlation: 0.979
Test data Pearson correlation: 0.981
Test data Mean Absolute Error: 20243.0
In [67]:
# scatterplots of y (observed) and yhat (predicted)

fig = plt.figure(figsize=(13,4))

ax1 = plt.subplot(121)
ax1 = sns.scatterplot(x=y_test['real_price'], y=predicted_test, edgecolors='w', alpha=0.9, s=8)
ax1.set_xlabel('Observed'), ax1.set_xticklabels(['{:,.0f}'.format(x) + 'K' for x in ax1.get_xticks()/1000])
ax1.set_ylabel('Predicted'), ax1.set_yticklabels(['{:,.0f}'.format(x) + 'K' for x in ax1.get_yticks()/1000])
ax1.annotate('Test R\u00b2: ' + str(round(oob_test_score,3)) + '\nTest MAE: ' + str(round(oob_mae)), xy=(0, 1), xytext=(25, -35),
    xycoords='axes fraction', textcoords='offset points', fontsize=12)
ax1.set_title('Tuned Using Out-Of-Bag')

ax2 = plt.subplot(122)
ax2 = sns.scatterplot(x=y_test['real_price'], y=cv_predicted_test, edgecolors='w', alpha=0.9, s=8)
ax2.set_xlabel('Observed'), ax2.set_xticklabels(['{:,.0f}'.format(x) + 'K' for x in ax2.get_xticks()/1000])
ax2.set_ylabel('Predicted'), ax2.set_yticklabels(['{:,.0f}'.format(x) + 'K' for x in ax2.get_yticks()/1000])
ax2.annotate('Test R\u00b2: ' + str(round(cv_test_score,3)) + '\nTest MAE: ' + str(round(cv_mae)), xy=(0, 1), xytext=(25, -35),
    xycoords='axes fraction', textcoords='offset points', fontsize=12)
ax2.set_title('Tuned Using Cross Validation')
plt.tight_layout(pad=0, rect=[0, 0, 0.9, 0.9])
plt.close()

In [ ]:
Random Forest appears to be performing better than linear regression in terms of R square.

6.3. Feature Importance

In [68]:
fig = plt.figure(figsize=(14,7))

ax1 = plt.subplot(121)
feat_imp = pd.DataFrame({'Features': X_train.columns, 'Feature Importance': rf.feature_importances_}).sort_values('Feature Importance', ascending=False)
sns.barplot(y='Features', x='Feature Importance', data=feat_imp)
#plt.xticks(rotation=45, ha='right')
ax1.set_title('OOB Feature Importance', size=15)

ax2 = plt.subplot(122)
feat_imp = pd.DataFrame({'Features': X_train.columns, 'Feature Importance': rfr.best_estimator_.feature_importances_}).sort_values('Feature Importance', ascending=False)
sns.barplot(y='Features', x='Feature Importance', data=feat_imp)
ax2.set_title('CV Feature Importance', size=15)

plt.tight_layout(pad=0, rect=[0, 0, 0.9, 0.9])
fig.show()

The feature importance are abit different from the linear regression model. Floor area and lease commence date are still one of the main drivers of resale prices. However, features like distance from Dhoby Ghaut MRT and flat type also appears to be good drivers.

Tree-based models seem to give lower importance to categorical values. This is due to the importance score being a measure of how often the feature was selected for splitting and how much gain in purity was achieved as a result of the selection.

6.4. SHAP Values

We can also look at feature importance using SHAP (SHapley Additive exPlanations) values first proposed by Lundberg and Lee (2006) for model interpretability of any machine learning model. SHAP values have a few advantages:

  1. Directionality — Unlike the feature importance from random forest, SHAP values allows us to see the importance of a feature and the direction in which it influences the outcome
  2. Global interpretability — the collective SHAP values can show how much each predictor contributes to the entire dataset (this is not shown here as it takes a long time for a large dataset)
  3. Local interpretability — each observation gets its own SHAP values, allowing us to identify which features are more important for each observation
  4. SHAP values can be calculated for any tree-based model, which other methods are not able to do

Below shows the contributors to resale price for an example of low, middle and high resale price flats.

Flats with predicted low resale price

In [69]:
import shap
shap.initjs()

explainer = shap.TreeExplainer(rfr.best_estimator_)
shap_values = explainer.shap_values(X_test.iloc[[16]])
shap.force_plot(explainer.expected_value[0], shap_values[0], X_test.iloc[[16]])
Out[69]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
In [70]:
explainer = shap.TreeExplainer(rfr.best_estimator_)
shap_values = explainer.shap_values(X_test.iloc[[5]])
shap.force_plot(explainer.expected_value[0], shap_values[0], X_test.iloc[[5]])
Out[70]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.

Flats with predicted medium resale price

In [71]:
explainer = shap.TreeExplainer(rfr.best_estimator_)
shap_values = explainer.shap_values(X_test.iloc[[1]])
shap.force_plot(explainer.expected_value[0], shap_values[0], X_test.iloc[[1]])
Out[71]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
In [72]:
explainer = shap.TreeExplainer(rfr.best_estimator_)
shap_values = explainer.shap_values(X_test.iloc[[100]])
shap.force_plot(explainer.expected_value[0], shap_values[0], X_test.iloc[[100]])
Out[72]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.

Flats with predicted high resale price

In [73]:
explainer = shap.TreeExplainer(rfr.best_estimator_)
shap_values = explainer.shap_values(X_test.iloc[[172]])
shap.force_plot(explainer.expected_value[0], shap_values[0], X_test.iloc[[172]])
Out[73]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
In [74]:
explainer = shap.TreeExplainer(rfr.best_estimator_)
shap_values = explainer.shap_values(X_test.iloc[[10084]])
shap.force_plot(explainer.expected_value[0], shap_values[0], X_test.iloc[[10084]])
Out[74]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
In [75]:
print("Flat Type Encoding = 2 ROOM:0, 3 ROOM:1, 4 ROOM:2, 5 ROOM:3, EXECUTIVE:4")
Flat Type Encoding = 2 ROOM:0, 3 ROOM:1, 4 ROOM:2, 5 ROOM:3, EXECUTIVE:4

7. Conclusion

In this project, linear regression and random forest were used to looked at the drivers of HDB resale prices. Linear regression is powerful because it allows one to interpret the results of the model by looking at its coefficients for every feature. However, it assumes a linear relationship between the features and the outcome, which isn't always the case in real life. It also tends to suffer from bias due to its parametric nature. Conversely, non-parametric methods do not assume any function or shape, and random forest is a powerful non-linear machine learning model which uses bootstrap aggregating (bagging) and ensembling methods. A single decision tree has high variance as it tends to overfit to the data. Through bagging and ensembling, it is able to reduce the variance of each tree by combining them.

Looking at the output of the models, linear regression showed that regions, floor area, flat model, lease commencement date and distance from hawker are the top 5 drivers of HDB prices. However, random forest gave a slightly different result. floor area, and lease commencement date and distance from hawker still in the top 5 while distance from Dhoby Ghaut MRT and flat type has also came up on top. This happens as tree-based models tend to give lower importance to categorical variables (region and flat model) due to the way it computes importance.

Nevertheless, the size of the flat, lease date, and certain aspects of the location appear to be consistently the most important drivers of HDB resale prices.

Things to improve on:

  • Remove lease commencement date from the linear regression model and rerun it again as it still has a high value of VIF, which could lead to a poor estimation of the coefficients of the model. Also, can try out different removal of features to see if VIF is still high, and then rerun the linear regression to see which set of features is the best. This will allow us to see whether features that are important in the random forest (distance from dhoby) are important using linear regression as well.
  • Some of the amenities might not have been built when the flat were sold. For example, many of the stations in Downtown line (e.g., Tampines West, Bedok North, Fort Canning) were only opened in 2017 and so flats sold in 2015 to 2016 were not affected by them. A better way is to take this into account when computing the nearest distance to amenities and the number of amenities. Although the difference might be small and I don't think it will make a huge difference, it is definitely a cleaner approach.
  • Account for number of BTOs offered each year
  • Compute travelling time instead of euclidean distance of flats from certain locations
  • Try out different ways to recategorize flat models, as the current way was arbitrary, and see if that changes the results of the models.
  • There are many other factors when considering the price of a HDB, some of which are harder to procure, some easier. The condition/cleaniness of the block, the neighbours, the position in which the flat faces, the availability of Built-to-Order (BTO) flats, changes in government policies, are all examples of important factors to consider
In [ ]: